Backward anti-collision driving decision-making method for heavy commercial vehicle

ABSTRACT

The present invention discloses a backward anti-collision driving decision-making method for a heavy commercial vehicle. Firstly, a traffic environment model is established, and movement state information of a heavy commercial vehicle and a vehicle behind the heavy commercial vehicle is collected. Secondly, a backward collision risk assessment model based on backward distance collision time is established, and a backward collision risk is accurately quantified. Finally, a backward anti-collision driving decision-making problem is described as a Markov decision-making process under a certain reward function, a backward anti-collision driving decision-making model based on deep reinforcement learning is established, and an effective, reliable and adaptive backward anti-collision driving decision-making policy is obtained. The method provided by the present invention can overcome the defect of lack for research on the backward anti-collision driving decision-making policy for the heavy commercial vehicle in the existing method, can quantitatively output proper steering wheel angle and throttle opening control quantities, can provide effective and reliable backward anti-collision driving suggestions for a driver, and can reduce backward collision accidents.

TECHNICAL FIELD

The present invention relates to an anti-collision driving decision-making method, in particular to a backward anti-collision driving decision-making method for a heavy commercial vehicle, belonging to the technical field of automobile safety.

BACKGROUND

As the main undertaker of road transportation, the safety status of commercial vehicles directly affects the safety of road transportation. Vehicle collision is the main accident form in the process of road transportation. Heavy commercial vehicles represented by dangerous goods transport tankers mostly contain flammable, explosive, highly toxic (for example, methanol and acrylonitrile) and dangerous chemicals. Compared with forward collision, backward collision is more likely to lead to tank damage, leading to serious consequences such as leakage, combustion and explosion of dangerous goods in the tank, and the secondary damage is far more than the damage caused by the collision accident itself and has a higher risk. As an important part of the active prevention and control of backward collision, driving decision-making can greatly reduce the frequency of traffic accidents caused by backward collision or reduce the damage caused by backward collision if it can warn the driver before the occurrence of backward collision accidents and remind the driver to take proper acceleration, lane change and other measures. Therefore, the research on the backward anti-collision driving decision-making method for the heavy commercial vehicle has important social significance and practical value for ensuring road traffic safety.

At present, there are standards, patents and documents on vehicle backward collision prevention. In terms of standards, the Ministry of Transport issued the transportation industry standard Performance Requirements and Test Procedures for Backward Collision Early Warning Systems of Commercial Vehicles, stipulating the performance of backward collision early warning systems mounted on commercial vehicles. However, it is only limited to a collision early warning level and does not involve backward anti-collision driving decision-making. In terms of patent documents, most of the research on backward collision prevention is for small passenger vehicles. Compared with passenger vehicles, heavy commercial vehicles have the characteristics of high centroid position, large load capacity and the like. In the process of sharp turn or emergency lane change, the shaking of the tank or trailer will further increase the instability of the vehicle, and the vehicle is very easy to lose stability and roll over. Therefore, it is difficult to apply a driving decision-making method for passenger vehicles to heavy commercial vehicles. Generally speaking, the existing research does not involve the backward anti-collision driving decision-making of heavy commercial vehicles, and especially there is a lack for research on backward anti-collision driving decision-making for heavy commercial vehicles, and the decision-making is effective, reliable and adaptive to traffic environment characteristics.

SUMMARY

Objectives of the invention: in order to realize a backward anti-collision driving decision-making method for a heavy commercial vehicle (the method is effective, reliable and adaptive to traffic environment characteristics), the present invention discloses a backward anti-collision driving decision-making method for a heavy commercial vehicle. The method can overcome the defect of lack for the backward anti-collision driving decision-making policy for the heavy commercial vehicle in the existing method, can quantitatively output proper steering wheel angle and throttle opening control quantities, can provide effective and reliable backward anti-collision driving suggestions for a driver, and can realize backward anti-collision driving decision-making for the heavy commercial vehicle (the decision-making is effective, reliable and adaptive to the traffic environment).

Technical solutions: the present invention provides a backward anti-collision driving decision-making method based on deep reinforcement learning for heavy commercial vehicles, such as semi-trailer tankers and semi-trailer trains. Firstly, a virtual traffic environment model is established, and movement state information of a heavy commercial vehicle and a vehicle behind the heavy commercial vehicle is collected. Secondly, a backward collision risk assessment model based on backward distance collision time is established, and a backward collision risk is accurately quantified. Finally, a backward anti-collision driving decision-making problem is described as a Markov decision-making process under a certain reward function, a backward anti-collision driving decision-making model based on deep reinforcement learning is established, and an effective, reliable and adaptive backward anti-collision driving decision-making policy is obtained. The method includes the following steps:

Step I: A Virtual Traffic Environment Model is Established.

In order to reduce the frequency of traffic accidents caused by backward collision and improve the safety of heavy commercial vehicles, the present invention provides a backward anti-collision driving decision-making method, and the method is applicable to the following scenario: there are no obstacles and other interference factors in front of a heavy commercial vehicle during the running of the vehicle, and in order to prevent backward collision with a vehicle behind, a decision-making policy such as acceleration and steering should be effectively provided for a driver in time to avoid collision accidents.

In the actual road test process, the relevant tests of heavy commercial vehicles have high test cost and risk. In order to reduce the test cost and risk while taking into account the test efficiency, the present invention establishes a virtual traffic environment model for high-class highways, that is, a three-lane virtual environment model including straight lanes and curved lanes. The heavy commercial vehicle moves in the traffic environment model, and a target vehicle (including 3 types: small, medium and large vehicles) follows the heavy commercial vehicle, and in the process, there are 4 different running conditions, including acceleration, deceleration, uniform velocity and lane change.

Movement state information can be obtained in real time through a centimeter-level high-precision differential GPS, an inertia measurement unit and a millimeter wave radar mounted on each vehicle, including positions, velocity, acceleration, relative distance and relative velocity of the two vehicles. A type of the target vehicle can be obtained in real time through a visual sensor mounted at a rear part of the vehicle. Drivers control information can be read through a CAN bus, including throttle opening and steering wheel angle of the vehicle.

In the present invention, the target vehicle refers to a vehicle located behind the heavy commercial vehicle on a running road, located within the same lane line, running in the same direction and closest to the heavy commercial vehicle.

Step II: A Backward Collision Risk Assessment Model is Established.

In order to properly and effectively output the backward anti-collision decision-making policy, it is necessary to accurately assess the backward collision risk level of the heavy commercial vehicle in real time. Firstly, time required for collision between the heavy commercial vehicle and the target vehicle is calculated:

$\begin{matrix} {{{RTTC}(t)} = {- \frac{x_{c}(t)}{v_{r}(t)}}} & (1) \end{matrix}$

in formula (1), RTTC(t) represents backward distance collision time at time t in unit of second, x_(c)(t) represents vehicle distance in unit of meter, v_(F)(t) and v_(R)(t) respectively represent the velocity of the heavy commercial vehicle and the target vehicle, v_(r)(t) represents the relative velocity of the two vehicles in unit of meter per second, and v_(r)(t)=v_(F)(t)−v_(R)(t).

Secondly, a backward collision risk level is calculated. According to the national standard Performance Requirements and Test Procedures for Backward Collision Early Warning Systems of Commercial Vehicles, when the backward distance collision time is not less than 2.1 seconds and not more than 4.4 seconds, a backward collision alarm is given, indicating that a backward collision early warning system has passed a test. Based on this, the backward collision risk level is quantified:

$\begin{matrix} {\delta_{w} = \frac{{{RTTC}(t)} - 2.1}{4.4 - 2.1}} & (2) \end{matrix}$

in formula (2), δ_(w) represents a quantified value of a backward collision risk. When δ_(w)>1, it indicates that there is no backward collision risk; when 0.5≤δ_(w)≤1, it indicates that there is a backward collision risk; and when 0≤δ_(w)≤0.5, it indicates that the backward collision risk level is very high.

Step III: A Backward Anti-Collision Driving Decision-Making Model of the Heavy Commercial Vehicle is Established.

In order to realize backward anti-collision driving decision-making (the decision-making is effective, reliable and adaptive to the traffic environment), the present invention comprehensively considers the influence of traffic environment, vehicle operation state, rear vehicle type and backward collision risk level on backward collision, and establishes a backward anti-collision driving decision-making model of the heavy commercial vehicle.

Common driving decision-making methods include rule-based and data-learning-based decision-making algorithms. (1) The rule-based decision-making algorithm uses a finite directed connected graph to describe different driving states and the transition relationship between states, so as to generate driving actions according to the migration of driving states. However, in the process of vehicle movement, there are uncertainties in vehicle movement parameters, road conditions and rear traffic conditions. The prepared rules are difficult to traverse all scenarios and ensure the effectiveness and adaptability of decision-making. (2) The data-learning-based decision-making algorithm uses an algorithm to imitate the learning process of human to knowledge or skills, so as to realize the continuous improvement of its own learning performance through an interactive self-learning mechanism. The method based on deep reinforcement learning combines the perception ability of deep learning with the decision-making ability of reinforcement learning to meet the adaptability of anti-collision decision-making to the traffic environment and running conditions due to adaptability to an uncertain problem. Therefore, the present invention adopts the deep reinforcement learning algorithm to establish the backward anti-collision driving decision-making model.

The decision-making methods based on deep reinforcement learning mainly include decision-making methods based on value function, policy search and Actor-Critic architecture. The deep reinforcement learning algorithm based on value cannot deal with the problem of continuous output, and cannot meet the need of continuous output of driving policies in anti-collision decision-making. The method based on policy search has the defects that it is sensitive to step size and difficult to choose step size, etc. The decision-making method based on Actor-Critic architecture combines value function estimation and policy search, and is fast in update speed. Proximate Policy Optimization (PPO) solves the problems of slow parameter update and difficulty in determining the step size, and achieves good results in outputting continuous action spaces. Therefore, the present invention adopts a PPO algorithm to establish the backward anti-collision driving decision-making model, and obtains the optimal backward anti-collision decision through interactive iterative learning with a target vehicle movement random process model. This step specifically includes the following 4 sub-steps:

Sub-Step 1: Basic Parameters of the Backward Anti-Collision Driving Decision-Making Model are Defined.

Firstly, a backward anti-collision driving decision-making problem is described as a Markov decision-making process (S,A,P,r) under a certain reward function, wherein S is a state space, A is a backward anti-collision action decision, P is a state transition probability caused by movement uncertainty of the target vehicle, and r is a reward function. Secondly, basic parameters of the Markov decision-making process are defined specifically as follows:

(1) A State Space is Defined.

A state space expression is established by using the vehicle movement state information output in step I and the backward collision risk level output in step II:

S_(t)=(v_(F_lon),a_(F_lon),v_(r_lon),a_(r_lon),θ_(str),p_(thr),L_(r),δ_(w),T_(m))  (3)

in formula (3), S_(t) represents a state space at time t, v_(F_lon) and v_(r_lon) respectively represent the longitudinal velocity of the heavy commercial vehicle and the relative longitudinal velocity of the two vehicles in unit of meter per second, a_(F_lon) and a_(r_lon) respectively represent the longitudinal acceleration of the heavy commercial vehicle and the relative longitudinal acceleration of the two vehicles in unit of meter per square second, θ_(str) represents a steering wheel angle of the vehicle in unit of degree, p_(thr) represents a throttle opening in unit of percentage, L_(r) represents a relative vehicle distance in unit of meter, δ_(w) and T_(m) respectively represent the backward collision risk level and the type of the target vehicle, m=1,2,3 respectively represent that the target vehicle is a large vehicle, a medium vehicle and a small vehicle, and T_(m)=m in the present invention.

(2) An Action Decision is Defined.

In order to comprehensively consider the influence of transverse movement and longitudinal movement on backward collision, the present invention defines a driving policy, that is, an action decision output by the decision-making model, by using the steering wheel angle and the throttle opening as control quantities in the present invention:

A_(t)=[θ_(str_out),p_(thr_out)]  (4)

in formula (4), A_(t) represents an action decision at time t, θ_(str_out) represents a normalized steering wheel angle control quantity in a range of [−1, 1], and P_(thr_out) represents a normalized throttle opening control quantity in a range of [0, 1]. When p_(thr_out)=0, it indicates that the vehicle does not accelerate, and when δ_(brake)=1, it indicates that the vehicle accelerates at a maximum acceleration.

(3) A Reward Function is Established.

In order to evaluate the advantages and disadvantages of the action decision, a reward function is established to concretize and digitalize the evaluation. Considering that backward anti-collision driving decision-making is a multi-objective optimization problem involving safety, comfort and other objectives, the present invention designs the reward function as follows:

r_(t)=r₁+r₂+r₃  (5)

in formula (5), r_(t) represents a reward function at time t, r₁ represents a safety distance reward function, r₂ represents a comfort reward function, and r₃ represents a penalty function.

Firstly, a safety distance reward function r₁ is designed:

$\begin{matrix} {r_{1} = \left\{ \begin{matrix} {\omega_{d}\left( {L_{r} - L_{s}} \right)} & {L_{r} \geq L_{s}} \\ 0 & {L_{r} > L_{s}} \end{matrix} \right.} & (6) \end{matrix}$

in formula (6), L_(r) and L_(s) respectively represent relative vehicle distance and a safety distance threshold, and ω_(d) represents a safety distance weight coefficient valued as ω_(d)=0.85 in the present invention.

Secondly, in order to ensure the driving comfort of the vehicle, excessive impact should be avoided as much as possible, and a comfort reward function r₂ is designed:

r₂=ω_(j)|a_(F_lon)(t+1)−a_(F_lon)(t)|  (7)

in formula (7), ω_(j) is a comfort weight coefficient valued as ω_(j)=0.95 in the present invention.

Finally, a penalty function r₃ is designed:

$\begin{matrix} {r_{3}\left\{ \begin{matrix} {{- 100},{collision}} \\ {{- 100},{rollover}} \\ {0,{{no}{collision}{or}{rollover}}} \end{matrix} \right.} & (8) \end{matrix}$

(4) An Expected Maximum Policy is Designed.

$\begin{matrix} {\pi^{*} = {\arg\max\limits_{\pi_{\theta}}{E_{\tau(\pi_{\theta})}\left\lbrack {{\sum}_{t = 0}\gamma_{t}r_{t}} \right\rbrack}}} & (9) \end{matrix}$

in formula (9), π* is an expected maximum policy, π is a backward anti-collision decision-making policy, γ is a discount factor, γϵ(0,1), and τ(π) represents trajectory distribution under policy π.

Sub-Step 2: A Network Architecture of the Backward Anti-Collision Driving Decision-Making Model is Designed.

A backward anti-collision driving decision-making network is set up by using an “Actor-Critic” network framework, including an Actor network and a Critic network. The Actor network uses state space information as an input and outputs an action decision, that is, the throttle opening and steering wheel angle control quantities of the heavy commercial vehicle. The Critic network uses the state space information and the action decision as an input and outputs a value of current “state-action”. The process is specifically as follows:

(1) An Actor Network is Designed.

A hierarchical coder structure is established, and features of various information in the state space are respectively extracted. Firstly, 3 serially connected convolution layers (C_(F1), C_(F2), C_(F3)) and 1 maximum pooling layer (P₁) are constructed, features of the movement state information (longitudinal velocity, longitudinal acceleration, steering wheel angle, and throttle opening) of the vehicle are extracted, and they are coded into an intermediate feature vector h₁; features of the relative movement state information (relative longitudinal velocity, relative longitudinal acceleration, and relative vehicle distance) of the two vehicles are extracted by using the same structure, that is, 3 serially connected convolution layers (C_(R1), C_(R2), C_(R3)) and 1 maximum pooling layer (P₂), and they are coded into an intermediate feature vector h₂; and features of the collision risk level and the type of the target vehicle are extracted by using a convolution layer C_(W1) and a maximum pooling layer P₃, and they are coded into an intermediate feature vector h₃. Secondly, the features h₁, h₂ and h₃ are combined and full connection layers FC₄ and FC₅ are connected to output the action decision.

The number of neurons of the convolution layers C_(F1), C_(F2), C_(F3), C_(R1), C_(R2), C_(R3) and C_(W1) is set to be 20, 20, 10, 20, 20, 10 and 20 respectively; and the number of neurons of the full connection layers FC₄ and FC₅ is set to be 200. The activation function of each convolution layer and full connection layer is a Rectified Linear Unit (ReLU), and an expression thereof is f(x)=max(0,x)

(2) A Critic Network is Designed.

A critic network is established by using a neural network with a multiple hidden layer structure. Firstly, a state space S_(t) is input into a hidden layer FC_(C1); and at the same time, an action decision A_(t) is input into a hidden layer FC_(C2). Secondly, the hidden layers FC_(C1) and FC_(C2) are combined by tensor addition. Finally, after passing through the full connection layers FC_(C3) and FC_(C4) sequentially, a value of the Critic network is output.

The number of neurons of the layers FC_(C1) and FC_(C2) is set to be 400, the number of neurons of other hidden layers is set to be 200, and the activation function of each layer is an ReLU.

Sub-Step 3: The Backward Anti-Collision Driving Decision-Making Model is Trained.

Gradient updating is performed to the network parameters by using loss functions J_(actor) and J_(critic). A specific training process is as follows:

Sub-step 3.1: the Actor network and the Critic network are initialized.

Sub-step 3.2: iterative solution is performed, wherein each iteration includes sub-step 3.21 to sub-step 3.4 specifically as follows:

Sub-step 3.21: iterative solution is performed, wherein each iteration includes sub-step 3.211 to sub-step 3.213 specifically as follows:

Sub-step 3.211: a movement control operation of the vehicle is obtained by using the virtual traffic environment model in step I.

Sub-step 3.212: sample data (S_(t),A_(t),r_(t)) are obtained by using the Actor network.

Sub-step 3.213: a cycle is ended to obtain a sample point set [(S₁,A₁,r₁), (S₂,A₂,r₂), . . . , (S_(t),A_(t),r_(t))].

Sub-step 3.22: an advantage function is calculated:

$\begin{matrix} {{\hat{F}}_{t} = {{\sum\limits_{t^{\prime} > t}{\gamma^{t^{\prime} - 1}r_{t^{\prime}}}} - {V\left( S_{t} \right)}}} & (10) \end{matrix}$

in formula (10), {circumflex over (F)}_(t) represents an advantage function, V(S_(t)) represents a value function of state S_(t), {circumflex over (F)}_(t)>0 represents that the possibility of taking a current action should be increased, and {circumflex over (F)}_(t)<0 represents that the possibility of taking the action should be decreased.

Sub-step 3.23: iterative solution is performed, wherein each iteration includes sub-step 3.231 to sub-step 3.233 specifically as follows:

Sub-step 3.231: an objective function of the Actor network is calculated.

Sub-step 3.232: the parameter J_(actor) of the Actor network is updated:

$\begin{matrix} {J_{actor} = {\sum\limits_{({s_{1},a_{1}})}{\min\left\lbrack {{{p_{t}(\theta)}{\hat{F}}_{t}},{{{clip}\left( {{p_{t}(\theta)},{1 - \varepsilon},{1 + \varepsilon}} \right)}{\hat{F}}_{t}}} \right\rbrack}}} & (11) \end{matrix}$

in formula (11), p_(t)(θ) represents a ratio of a new policy π_(θ) to an old policy π_(θ_old) on action decision distribution in a policy updating process,

${{p_{t}(\theta)} = \frac{\pi_{\theta}\left( {A_{t}❘S_{t}} \right)}{\pi_{{\theta\_}{old}}\left( {A_{t}❘S_{t}} \right)}},$

clip(⋅) represents a clipping function, and ε is a constant valued as ε=0.25 in the present invention.

Sub-step 3.233: the parameter J_(critic) of the Critic network is updated:

$\begin{matrix} {J_{critic} = {- {\sum\limits_{t = 1}^{T}\left\lbrack {{\sum\limits_{t^{\prime} > t}{\gamma^{t^{\prime} - 1}r_{t^{\prime}}}} - {V\left( S_{t} \right)}} \right\rbrack^{2}}}} & (12) \end{matrix}$

Sub-step 3.234: a cycle is ended.

Sub-step 3.3: iterative updating is performed according to the method provided in sub-step 3.2 to make the Actor network and the Critic network converge gradually. In a training process, if the vehicle has a backward collision or rollover, a current round is terminated and a new round for training is started. When the iteration reaches the maximum number of steps or the model is capable of making a backward anti-collision driving decision stably and accurately, the training ends.

Sub-Step 4: The Decision-Making Policy is Output by Using the Backward Anti-Collision Decision-Making Model.

The information obtained by the centimeter-level high-precision differential GPS, the inertia measurement unit, the millimeter wave radar and the CAN bus is input into the trained backward anti-collision driving decision-making model, such that proper steering wheel angle and throttle opening control quantities are capable of being quantitatively output to provide an effective and reliable backward anti-collision driving suggestion for a driver, so as to realize effective, reliable and adaptive backward anti-collision driving decision-making of the heavy commercial vehicle.

Beneficial Effects

Compared with the existing technology, the technical solution of the present invention has the following beneficial technical effects, which are specifically embodied as follows:

(1) The method provided by the present invention realizes the backward anti-collision driving decision-making of the heavy commercial vehicle, and can provide an effective and reliable backward anti-collision driving decision-making policy for a driver.

(2) The method provided by the present invention comprehensively considers the influence of traffic environment, vehicle operation state, rear vehicle type and backward collision risk level on backward collision, and accurately quantifies the driving policy such as steering wheel angle and throttle opening in the form of numerical value. The output driving policy can be adjusted adaptively according to the traffic environment and drivers operation, thus improving the effectiveness, reliability and environmental adaptability of backward anti-collision driving decision-making of the heavy commercial vehicle.

(3) The method provided by the present invention does not need to consider complex vehicle dynamic equations and body parameters. The calculation method is simple and clear.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of a technical route according to the present invention.

FIG. 2 is a schematic diagram of a network architecture of a backward anti-collision driving decision-making model established according to the present invention.

DETAILED DESCRIPTION

The technical solutions of the present invention are further described below with reference to the accompanying drawings.

In order to establish a backward anti-collision decision-making policy (the policy is effective, reliable and adaptive to the traffic environment), and realize backward anti-collision driving decision-making of heavy commercial vehicles, so as to fill the blank of backward anti-collision driving decision-making technology of heavy commercial vehicles in practical application, the present invention provides a backward anti-collision driving decision-making method based on deep reinforcement learning for heavy commercial vehicles, such as semi-trailer tankers and semi-trailer trains. Firstly, a virtual traffic environment model is established, and movement state information of a heavy commercial vehicle and a vehicle behind the heavy commercial vehicle is collected. Secondly, a backward collision risk assessment model based on backward distance collision time is established, and a backward collision risk is accurately quantified. Finally, a backward anti-collision driving decision-making problem is described as a Markov decision-making process under a certain reward function, a backward anti-collision driving decision-making model based on deep reinforcement learning is established, and an effective, reliable and adaptive backward anti-collision driving decision-making policy is obtained. A technical route according to the present invention is as illustrated in FIG. 1 . Specific steps are as follows:

Step I: A Virtual Traffic Environment Model is Established.

In order to reduce the frequency of traffic accidents caused by backward collision and improve the safety of heavy commercial vehicles, the present invention provides a backward anti-collision driving decision-making method, and the method is applicable to the following scenario: there are no obstacles and other interference factors in front of a heavy commercial vehicle during the running of the vehicle, and in order to prevent backward collision with a vehicle behind, a decision-making policy such as acceleration and steering should be effectively provided for a driver in time to avoid collision accidents.

In the actual road test process, the anti-collision tests of heavy commercial vehicles have high test cost and risk. In order to reduce the test cost and risk while taking into account the test efficiency, the present invention establishes a virtual traffic environment model for high-class highways, that is, a three-lane virtual environment model including straight lanes and curved lanes. The heavy commercial vehicle moves in the traffic environment model, and a target vehicle (including 3 types: small, medium and large vehicles) follows the heavy commercial vehicle, and in the process, there are 4 different running conditions, including acceleration, deceleration, uniform velocity and lane change.

Movement state information can be obtained in real time through a centimeter-level high-precision differential GPS, an inertia measurement unit and a millimeter wave radar mounted on each vehicle, including positions, velocity, acceleration, relative distance and relative velocity of the two vehicles. A type of the target vehicle can be obtained in real time through a visual sensor mounted at a rear part of the vehicle. Drivers control information can be read through a CAN bus, including throttle opening and steering wheel angle of the vehicle.

In the present invention, the target vehicle refers to a vehicle located behind the heavy commercial vehicle on a running road, located within the same lane line, running in the same direction and closest to the heavy commercial vehicle.

Step II: A Backward Collision Risk Assessment Model is Established.

In order to reasonably and effectively output the backward anti-collision decision-making policy, it is necessary to accurately assess the backward collision risk level of the heavy commercial vehicle in real time. Firstly, time required for collision between the heavy commercial vehicle and the target vehicle is calculated:

$\begin{matrix} {{{RTTC}(t)} = {- \frac{x_{c}(t)}{v_{r}(t)}}} & (1) \end{matrix}$

in formula (1), RTTC(t) represents backward distance collision time at time tin unit of second, x_(c)(t) represents a vehicle distance in unit of meter, v_(F)(t) and v_(R)(t) respectively represent the velocity of the heavy commercial vehicle and the target vehicle, v_(r)(t) represents the relative velocity of the two vehicles in unit of meter per second, and v_(r)(t)=v_(F)(t)−v_(R)(t).

Secondly, a backward collision risk level is calculated. According to the national standard Performance Requirements and Test Procedures for Backward Collision Early Warning Systems of Commercial Vehicles, when the backward distance collision time is not less than 2.1 seconds and not more than 4.4 seconds, a backward collision alarm is given, indicating that a backward collision early warning system has passed a test. Based on this, the backward collision risk level is quantified:

$\begin{matrix} {\delta_{w} = \frac{{{RTTC}(t)} - 2.1}{4.4 - 2.1}} & (2) \end{matrix}$

in formula (2), δ_(w) represents a quantified value of a backward collision risk. When δ_(w)≥1, it indicates that there is no backward collision risk; when 0.5≤δ_(w)≤1, it indicates that there is a backward collision risk; and when 0≤δ_(w)≤0.5, it indicates that the backward collision risk level is very high.

Step III: A Backward Anti-Collision Driving Decision-Making Model of the Heavy Commercial Vehicle is Established.

In order to realize backward anti-collision driving decision-making (the decision-making is effective, reliable and adaptive to the traffic environment), the present invention comprehensively considers the influence of traffic environment, vehicle operation state, rear vehicle type and backward collision risk level on backward collision, and establishes a backward anti-collision driving decision-making model of the heavy commercial vehicle.

Common driving decision-making methods include rule-based and data-learning-based decision-making algorithms. (1) The rule-based decision-making algorithm uses a finite directed connected graph to describe different driving states and the transition relationship between states, so as to generate driving actions according to the migration of driving states. However, in the process of vehicle movement, there are uncertainties in vehicle movement parameters, road conditions and rear traffic conditions. The prepared rules are difficult to traverse all scenarios and ensure the effectiveness and adaptability of decision-making. (2) The data-learning-based decision-making algorithm uses an algorithm to imitate the learning process of human to knowledge or skills, so as to realize the continuous improvement of its own learning performance through an interactive self-learning mechanism. The method based on deep reinforcement learning combines the perception ability of deep learning with the decision-making ability of reinforcement learning to meet the adaptability of anti-collision decision-making to the traffic environment and running conditions due to adaptability to an uncertain problem. Therefore, the present invention adopts the deep reinforcement learning algorithm to establish the backward anti-collision driving decision-making model.

The decision-making methods based on deep reinforcement learning mainly include decision-making methods based on value function, policy search and Actor-Critic architecture. The deep reinforcement learning algorithm based on value cannot deal with the problem of continuous output, and cannot meet the need of continuous output of driving policies in anti-collision decision-making. The method based on policy search has the defects that it is sensitive to step size and difficult to choose step size, etc. The decision-making method based on Actor-Critic architecture combines value function estimation and policy search, and is fast in update speed. Proximate Policy Optimization (PPO) solves the problems of slow parameter update and difficulty in determining the step size, and achieves good results in outputting continuous action spaces. Therefore, the present invention adopts a PPO algorithm to establish the backward anti-collision driving decision-making model, and obtains the optimal backward anti-collision decision through interactive iterative learning with a target vehicle movement random process model. This step specifically includes the following 4 sub-steps:

Sub-Step 1: Basic Parameters of the Backward Anti-Collision Driving Decision-Making Model are Defined.

Firstly, a backward anti-collision driving decision-making problem is described as a Markov decision-making process (s,A,P,r) under a certain reward function, wherein S is a state space, A is a backward anti-collision action decision, P is a state transition probability caused by movement uncertainty of the target vehicle, and r is a reward function. Secondly, basic parameters of the Markov decision-making process are defined specifically as follows:

(1) A State Space is Defined.

A state space expression is established by using the vehicle movement state information output in step I and the backward collision risk level output in step II:

S_(t)=(V_(F_lon),a_(F_lon),V_(r_lon),a_(r_lon),θ_(str),p_(thr),L_(r),δ_(w),T_(m))  (3)

in formula (3), St represents a state space at time t, v_(F_lon) and v_(r_lon) respectively represent the longitudinal velocity of the heavy commercial vehicle and the relative longitudinal velocity of the two vehicles in unit of meter per second, a_(F_lon) and a_(r_lon) respectively represent the longitudinal acceleration of the heavy commercial vehicle and the relative longitudinal acceleration of the two vehicles in unit of meter per square second, θ_(str) represents a steering wheel angle of the vehicle in unit of degree, p_(thr) represents a throttle opening in unit of percentage, L_(r) represents a relative vehicle distance in unit of meter, δ_(w) and T_(m) respectively represent the backward collision risk level and the type of the target vehicle, m=1,2,3 respectively represent that the target vehicle is a large vehicle, a medium vehicle and a small vehicle, and T_(m)=m in the present invention.

(2) An Action Decision is Defined.

In order to comprehensively consider the influence of transverse movement and longitudinal movement on backward collision, the present invention defines a driving policy, that is, an action decision output by the decision-making model, by using the steering wheel angle and the throttle opening as control quantities in the present invention:

A_(t)=[θ_(str_out),p_(thr_out)]  (4)

in formula (4), A_(t) represents an action decision at time t, θ_(str_out) represents a normalized steering wheel angle control quantity in a range of [−1, 1], and p_(thr_out) represents a normalized throttle opening control quantity in a range of [0, 1]. When p_(thr_out)=0, it indicates that the vehicle does not accelerate, and when δ_(brake)=1, it indicates that the vehicle accelerates at a maximum acceleration.

(3) A Reward Function is Established.

In order to evaluate the advantages and disadvantages of the action decision, a reward function is established to concretize and digitalize the evaluation. Considering that backward anti-collision driving decision-making is a multi-objective optimization problem involving safety, comfort and other objectives, the present invention designs the reward function as follows:

r_(t)=r₁+r₂+r₃  (5)

in formula (5), r_(t) represents a reward function at time t, r₁ represents a safety distance reward function, r₂ represents a comfort reward function, and r₃ represents a penalty function.

Firstly, a safety distance reward function r₁ is designed:

$\begin{matrix} {r_{1} = \left\{ \begin{matrix} {\omega_{d}\left( {L_{r} - L_{s}} \right)} & {L_{r} \geq L_{s}} \\ 0 & {L_{r} > L_{s}} \end{matrix} \right.} & (6) \end{matrix}$

in formula (6), L_(r) and L_(s) respectively represent relative vehicle distance and a safety distance threshold, and ω_(d) represents a safety distance weight coefficient valued as ω_(d)=0.85 in the present invention.

Secondly, in order to ensure the driving comfort of the vehicle, excessive impact should be avoided as much as possible, and a comfort reward function r₂ is designed:

r₂=ωj|a_(F_lon)(t+1)−a_(F_lon)(t)|  (7)

in formula (7), ω_(j) is a comfort weight coefficient valued as ω_(j)=0.95 in the present invention.

Finally, a penalty function r₃ is designed:

$\begin{matrix} {r_{3}\left\{ \begin{matrix} {{{- 1}00},{collision}} \\ {{{- 1}00},{rollover}} \\ {0,{{no}{collision}{or}{rollover}}} \end{matrix} \right.} & (8) \end{matrix}$

(4) An Expected Maximum Policy is Designed.

$\begin{matrix} {\pi^{*} = {\arg\max\limits_{\pi_{\theta}}E_{\tau(\pi_{\theta})}*\left\lbrack {\sum_{t = 0}{\gamma_{t}r_{t}}} \right\rbrack}} & (9) \end{matrix}$

in formula (9), π* is an expected maximum policy, π is a backward anti-collision decision-making policy, γ is a discount factor, γϵ(0,1), and τ(π) represents trajectory distribution under policy π.

Sub-Step 2: A Network Architecture of the Backward Anti-Collision Driving Decision-Making Model is Designed.

A backward anti-collision driving decision-making network is set up by using an “Actor-Critic” network framework, including an Actor network and a Critic network. The Actor network uses state space information as an input and outputs an action decision, that is, the throttle opening and steering wheel angle control quantities of the heavy commercial vehicle. The Critic network uses the state space information and the action decision as an input and outputs a value of current “state-action”. The network architecture is as illustrated in FIG. 2 . The specific steps are as follows:

(1) An Actor Network is Designed.

A hierarchical coder structure is established and features of various information in the state space are respectively extracted. Firstly, 3 serially connected convolution layers (C_(F1), C_(F2), C_(F3)) and 1 maximum pooling layer (P₁) are constructed, features of the movement state information (longitudinal velocity, longitudinal acceleration, steering wheel angle, and throttle opening) of the vehicle are extracted, and they are coded into an intermediate feature vector h₁; features of the relative movement state information (relative longitudinal velocity, relative longitudinal acceleration, and relative vehicle distance) of the two vehicles are extracted by using the same structure, that is, 3 serially connected convolution layers (C_(R1), C_(R2), C_(R3)) and 1 maximum pooling layer (P₂), and they are coded into an intermediate feature vector h₂; and features of the collision risk level and the type of the target vehicle are extracted by using a convolution layer C_(W1) and a maximum pooling layer P₃, and they are coded into an intermediate feature vector h₃. Secondly, the features h₁, h₂ and h₃ are combined and full connection layers FC₄ and FC₅ are connected to output the action decision.

The number of neurons of the convolution layers C_(F1), C_(F2), C_(F3), C_(R1), C_(R2), C_(R3) and C_(W1) is set to be 20, 20, 10, 20, 20, 10 and 20 respectively; the number of neurons of the full connection layers FC₄ and FC₅ is set to be 200. The activation function of each convolution layer and full connection layer is a Rectified Linear Unit (ReLU), and an expression thereof is f(x)=max(0, x)

(2) A Critic Network is Designed.

A Critic network is established by using a neural network with a multiple hidden layer structure. Firstly, a state space St is input into a hidden layer FC_(C1); and at the same time, an action decision A_(t) is input into a hidden layer FC_(C2). Secondly, the hidden layers FC_(C1) and FC_(C2) are combined by tensor addition. Finally, after passing through the full connection layers FC_(C3) and FC_(C4) sequentially, a value of the Critic network is output.

The number of neurons of the layers FC_(C1) and FC_(C2) is set to be 400, the number of neurons of other hidden layers is set to be 200, and the activation function of each layer is an ReLU.

Sub-Step 3: The Backward Anti-Collision Driving Decision-Making Model is Trained.

Gradient updating is performed to the network parameters by using loss functions J_(actor) and J_(critic). A specific training process is as follows:

Sub-step 3.1: the Actor network and the Critic network are initialized.

Sub-step 3.2: iterative solution is performed, wherein each iteration includes sub-step 3.21 to sub-step 3.4 specifically as follows:

Sub-step 3.21: iterative solution is performed, wherein each iteration includes sub-step 3.211 to sub-step 3.213 specifically as follows:

Sub-step 3.211: a movement control operation of the vehicle is obtained by using the virtual traffic environment model in step I.

Sub-step 3.212: sample data (S_(t),A_(t),r_(t)) are obtained by using the Actor network.

Sub-step 3.213: a cycle is ended to obtain a sample point set [(S₁,A₁,r₁), (S₂,A₂,r₂), . . . , (S_(t),A_(t),r_(t))].

Sub-step 3.22: an advantage function is calculated:

$\begin{matrix} {{\overset{\hat{}}{F}}_{t} = {{\sum\limits_{t^{\prime} > t}{\gamma^{t^{\prime} - 1}r_{t^{\prime}}}} - {V\left( S_{t} \right)}}} & (10) \end{matrix}$

in formula (10), {circumflex over (F)}_(t) represents an advantage function at time t, V (S_(t)) represents a value function of state S_(t), {circumflex over (F)}_(t)>0 represents that the possibility of taking a current action should be increased, and {circumflex over (F)}_(t)<0 represents that the possibility of taking the action should be decreased.

Sub-step 3.23: iterative solution is performed, wherein each iteration includes sub-step 3.231 to sub-step 3.233 specifically as follows:

Sub-step 3.231: an objective function of the Actor network is calculated.

Sub-step 3.232: the parameter J_(actor) of the Actor network is updated:

$\begin{matrix} {J_{actor} = {\sum\limits_{({s_{1},a_{1}})}{\min\left\lbrack {{{p_{t}(\theta)}\overset{\hat{}}{F}},{{{clip}{}\left( {{p_{t}(\theta)},{1 - \varepsilon},{1 + \varepsilon}} \right)}{\overset{\hat{}}{F}}_{t}}} \right\rbrack}}} & (11) \end{matrix}$

in formula (11), p_(t)(θ) represents a ratio of a new policy π_(θ) to an old policy π_(θ_old) on action decision distribution in a policy updating process,

${{p_{t}(\theta)} = \frac{\pi_{\theta}\left( A_{t} \middle| S_{t} \right)}{\pi_{{\theta\_}{old}}\left( A_{t} \middle| S_{t} \right)}},{{clip}( \cdot )}$

represents a clipping function, and ε is a constant valued as ε=0.25 in the present invention.

Sub-step 3.233: the parameter J_(critic) of the Critic network is updated:

$\begin{matrix} {J_{critic} = {- {\sum\limits_{t = 1}^{T}\left\lbrack {{\sum\limits_{t^{\prime} > t}{\gamma^{t^{\prime} - 1}r_{t^{\prime}}}} - {V\left( S_{t} \right)}} \right\rbrack^{2}}}} & (12) \end{matrix}$

Sub-step 3.234: a cycle is ended.

Sub-step 3.3: iterative updating is performed according to the method provided in sub-step 3.2 to make the Actor network and the Critic network converge gradually. In a training process, if the vehicle has a backward collision or rollover, a current round is terminated and a new round for training is started. When the iteration reaches the maximum number of steps or the model is capable of making a backward anti-collision driving decision stably and accurately, the training ends.

Sub-Step 4: The Decision-Making Policy is Output by Using the Backward Anti-Collision Decision-Making Model.

The information obtained by the centimeter-level high-precision differential GPS, the inertia measurement unit, the millimeter wave radar and the CAN bus is input into the trained backward anti-collision driving decision-making model, such that proper steering wheel angle and throttle opening control quantities are capable of being quantitatively output to provide an effective and reliable backward anti-collision driving suggestion for a driver, so as to realize effective, reliable and adaptive backward anti-collision driving decision-making of the heavy commercial vehicle. 

What is claimed is:
 1. A backward anti-collision driving decision-making method for a heavy commercial vehicle, wherein the method comprises the following steps: step I: establishing a virtual traffic environment model: for high-class highways, establishing a virtual traffic environment model, that is, a three-lane virtual environment model comprising straight lanes and curved lanes, wherein the heavy commercial vehicle moves in the traffic environment model, a target vehicle follows the heavy commercial vehicle, and in the process there are 4 different running conditions, comprising acceleration, deceleration, uniform velocity and lane change; in a process of establishing the virtual traffic environment model, vehicle movement state information is obtained in real time through a centimeter-level high-precision differential GPS, an inertia measurement unit and a millimeter wave radar mounted on each vehicle, comprising positions, velocity, acceleration, relative distance and relative velocity of the two vehicles; a type of the target vehicle is obtained in real time through a visual sensor mounted at a rear part of the vehicle; and drivers control information is read through a CAN bus, comprising a throttle opening and a steering wheel angle of the vehicle; the target vehicle refers to a vehicle located behind the heavy commercial vehicle on a running road, located within the same lane line, running in the same direction and closest to the heavy commercial vehicle, including 3 types: small, medium and large vehicles; step II: establishing a backward collision risk assessment model, specifically comprising: firstly, calculating time required for collision between the heavy commercial vehicle and the target vehicle: $\begin{matrix} {{{RTTC}(t)} = {- \frac{x_{c}(t)}{v_{r}(t)}}} & (1) \end{matrix}$ in formula (1), RTTC(t) represents backward distance collision time at time tin unit of second, x_(c)(t) represents vehicle distance in unit of meter, v_(F)(t) and v_(R)(t) respectively represent the velocity of the heavy commercial vehicle and the target vehicle, v_(r)(t) represents the relative velocity of the two vehicles in unit of meter per second, and v_(r)(t)=v_(F)(t)−v_(R)(t); secondly, calculating a backward collision risk level; when the backward distance collision time is not less than 2.1 seconds and not more than 4.4 seconds, giving a backward collision alarm, indicating that a backward collision early warning system has passed a test; and based on this, quantifying the backward collision risk level: $\begin{matrix} {\delta_{w} = \frac{{RTT{C(t)}} - {2.1}}{{4.4} - {2.1}}} & (2) \end{matrix}$ in formula (2), δ_(w) represents a quantified value of a backward collision risk; when δ_(w)≥1, it indicates that there is no backward collision risk; when 0.5≤δ_(w)≤1, it indicates that there is a backward collision risk; and when 0≤δ_(w)≤0.5, it indicates that the backward collision risk level is very high; step III: establishing a backward anti-collision driving decision-making model of the heavy commercial vehicle: comprehensively considering the influence of traffic environment, vehicle operation state, rear vehicle type and backward collision risk level on backward collision, establishing a backward anti-collision driving decision-making model of the heavy commercial vehicle by adopting a PPO algorithm, and performing interactive iterative learning with a target vehicle movement random process model to obtain an optimal backward anti-collision decision, specifically comprising the following 4 sub-steps: sub-step 1: defining basic parameters of the backward anti-collision driving decision-making model firstly, describing a backward anti-collision driving decision-making problem as a Markov decision-making process (S,A,P,r) under a certain reward function, wherein S is a state space, A is a backward anti-collision action decision, P is a state transition probability caused by movement uncertainty of the target vehicle, and r is a reward function; and secondly, defining basic parameters of the Markov decision-making process, specifically comprising: (1) defining a state space establishing a state space expression by using the vehicle movement state information output in step I and the backward collision risk level output in step II: S_(t)=(v_(F_lon),a_(F_lon),v_(r_lon),a_(r_lon),θ_(str),p_(thr),L_(r),δ_(w),T_(m))  (3) In formula (3), St represents a state space at time t, v_(F_lon) and v_(r_lon) respectively represent the longitudinal velocity of the heavy commercial vehicle and the relative longitudinal velocity of the two vehicles in unit of meter per second, a_(F_lon) and a_(r_lon) respectively represent the longitudinal acceleration of the heavy commercial vehicle and the relative longitudinal acceleration of the two vehicles in unit of meter per square second, θ_(str) represents a steering wheel angle of the vehicle in unit of degree, p_(thr) represents a throttle opening in unit of percentage, L_(r) represents a relative vehicle distance in unit of meter, δ_(w) and T_(m) respectively represent the backward collision risk level and the type of the target vehicle, m=1,2,3 respectively represent that the target vehicle is a large vehicle, a medium vehicle and a small vehicle, and T_(m)=m in the present invention; (2) defining an action decision in order to comprehensively consider the influence of transverse movement and longitudinal movement on backward collision, defining a driving policy, that is, an action decision output by the decision-making model, by using the steering wheel angle and the throttle opening as control quantities in the present invention: A_(t)=[θ^(str_out),p_(thr_out)]  (4) in formula (4), A_(t) represents an action decision at time t, θ_(str_out) represents a normalized steering wheel angle control quantity in a range of [−1, 1], and p_(thr_out) represents a normalized throttle opening control quantity in a range of [0, 1]; and when p_(thr_out)=0, it indicates that the vehicle does not accelerate, and when δ_(brake)=1, it indicates that the vehicle accelerates at a maximum acceleration; (3) establishing a reward function in order to evaluate the advantages and disadvantages of the action decision, establishing a reward function to concretize and digitalize the evaluation; and considering that backward anti-collision driving decision-making is a multi-objective optimization problem involving safety, comfort and other objectives, designing the reward function as follows: r_(t)=r₁+r₂+r₃  (5) in formula (5), r_(t) represents a reward function at time t, r₁ represents a safety distance reward function, r₂ represents a comfort reward function, and r₃ represents a penalty function: firstly, designing a safety distance reward function r₁: $\begin{matrix} {r_{1} = \left\{ \begin{matrix} {\omega_{d}\left( {L_{r} - L_{s}} \right)} & {L_{r} \geq L_{s}} \\ 0 & {L_{r} > L_{s}} \end{matrix} \right.} & (6) \end{matrix}$ in formula (6), L_(r) and L_(s) respectively represent relative vehicle distance and a safety distance threshold, ω_(d) represents a safety distance weight coefficient, valued as ω_(d)=0.85 in the present invention; secondly, designing a comfort reward function r₂: r₂=ω_(j)|a_(F_lon)(t+1)−a_(F_lon)(t)|  (7) in formula (7), ω_(j) is a comfort weight coefficient, valued as ω_(j)=0.95 in the present invention; finally, designing a penalty function r₃: $\begin{matrix} {r_{3}\left\{ \begin{matrix} {{{- 1}00},{collision}} \\ {{{- 1}00},{rollover}} \\ {0,{{no}{collision}{or}{rollover}}} \end{matrix} \right.} & (8) \end{matrix}$ (4) designing an expected maximum policy $\begin{matrix} {\pi^{*} = {\arg\max\limits_{\pi_{\theta}}{E_{\tau(\pi_{\theta})}\left\lbrack {\sum_{t - 0}{\gamma_{t}r_{t}}} \right\rbrack}}} & (9) \end{matrix}$ in formula (9), π* is an expected maximum policy, r is a backward anti-collision decision-making policy, γ is a discount factor, γϵ(0,1), and τ(π) represents trajectory distribution under policy π; sub-step 2: designing a network architecture of the backward anti-collision driving decision-making model setting up a backward anti-collision driving decision-making network by using an “Actor-Critic” network framework, comprising an Actor network and a Critic network, wherein the Actor network uses state space information as an input and outputs an action decision, that is, the throttle opening and steering wheel angle control quantities of the heavy commercial vehicle; the Critic network uses the state space information and the action decision as an input, and outputs a value of current “state-action”, specifically comprising: (1) designing an Actor network establishing a hierarchical coder structure and respectively extracting features of various information in the state space; firstly, constructing 3 serially connected convolution layers (C_(F1), C_(F2), C_(F3)) and 1 maximum pooling layer (P₁), extracting features of the movement state information (longitudinal velocity, longitudinal acceleration, steering wheel angle, and throttle opening) of the vehicle, and coding them into an intermediate feature vector h₁; extracting features of the relative movement state information (relative longitudinal velocity, relative longitudinal acceleration, and relative vehicle distance) of the two vehicles by using the same structure, that is, 3 serially connected convolution layers (C_(R1), C_(R2), C_(R3)) and 1 maximum pooling layer (P₂), and coding them into an intermediate feature vector h₂; extracting features of the collision risk level and the type of the target vehicle by using a convolution layer C_(W1) and a maximum pooling layer P₃, and coding them into an intermediate feature vector h₃; and secondly, combining the features h₁, h₂ and h₃ and connecting full connection layers FC₄ and FC₅ to output the action decision, wherein the number of neurons of the convolution layers C_(F1), C_(F2), C_(F3), C_(R1), C_(R2), C_(R3) and C_(W1) is set to be 20, 20, 10, 20, 20, 10 and 20 respectively; the number of neurons of the full connection layers FC₄ and FC₅ is set to be 200; the activation function of each convolution layer and full connection layer is a Rectified Linear Unit (ReLU), and an expression thereof is (x)=max (0, x); (2) designing a Critic network establishing a Critic network by using a neural network with a multiple hidden layer structure; firstly, inputting a state space St into a hidden layer FC_(C1); at the same time, inputting an action decision A_(t) into a hidden layer FC_(C2); secondly, combining the hidden layers FC_(C1) and FC_(C2) by tensor addition; and finally, after passing through the full connection layers FC_(C3) and FC_(C4) sequentially, outputting a value of the Critic network, wherein the number of neurons of the layers FC_(C1) and FC_(C2) is set to be 400, the number of neurons of other hidden layers is set to be 200, and the activation function of each layer is an ReLU; sub-step 3: training the backward anti-collision driving decision-making model performing gradient updating to the network parameters by using loss functions J_(actor) and J_(critic) wherein a specific training process is as follows: sub-step 3.1: initializing the Actor network and the Critic network; sub-step 3.2: performing iterative solution, wherein each iteration comprises sub-step 3.21 to sub-step 3.4 specifically as follows: sub-step 3.21: performing iterative solution, wherein each iteration comprises sub-step 3.211 to sub-step 3.213 as follows: sub-step 3.211: obtaining a movement control operation of the vehicle by using the virtual traffic environment model in step I; sub-step 3.212: obtaining sample data (S_(t),A_(t),f_(t)) by using the Actor network; sub-step 3.213: ending a cycle to obtain a sample point set [(S₁,A₁,r₁), (S₂,A₂,r₂), . . . , (S_(t),A_(t),r_(t))]; sub-step 3.22: calculating an advantage function: $\begin{matrix} {{\overset{\hat{}}{F}}_{t} = {{\sum\limits_{t^{\prime} > t}{\gamma^{t^{\prime} - 1}r_{t^{\prime}}}} - {V\left( S_{t} \right)}}} & (10) \end{matrix}$ in formula (10), {circumflex over (F)}_(t) represents an advantage function, V(S_(t)) represents a value function of state S_(t), {circumflex over (F)}_(t)>0 represents that the possibility of taking a current action should be increased, and {circumflex over (F)}_(t)<0 represents that the possibility of taking the action should be decreased; sub-step 3.23: performing iterative solution, wherein each iteration comprises sub-step 3.231 to sub-step 3.233 specifically as follows: sub-step 3.231: calculating an objective function of the Actor network; sub-step 3.232: updating the parameter J_(actor) of the Actor network: $\begin{matrix} {J_{actor} = {\sum\limits_{({s_{1},a_{1}})}{\min\left\lbrack {{{p_{t}(\theta)}\overset{\hat{}}{F}},{{{clip}{}\left( {{p_{t}(\theta)},{1 - \varepsilon},{1 + \varepsilon}} \right)}{\overset{\hat{}}{F}}_{t}}} \right\rbrack}}} & (11) \end{matrix}$ in formula (11), p_(t)(θ) represents a ratio of a new policy π^(θ) to an old policy π_(θ_old) on action decision distribution in a policy updating process, ${{p_{t}(\theta)} = \frac{\pi_{\theta}\left( A_{t} \middle| S_{t} \right)}{\pi_{{\theta\_}{old}}\left( A_{t} \middle| S_{t} \right)}},$ clip(⋅) represents a clipping function, and ε is a constant valued as ε=0.25; sub-step 3.233: updating the parameter J_(critic) of the Critic network: $\begin{matrix} {J_{critic} = {- {\sum\limits_{t = 1}^{T}\left\lbrack {{\sum\limits_{t^{\prime} > t}{\gamma^{t^{\prime} - 1}r_{t^{\prime}}}} - {V\left( S_{t} \right)}} \right\rbrack^{2}}}} & (12) \end{matrix}$ sub-step 3.234: ending a cycle; sub-step 3.3: performing iterative updating according to the method provided in sub-step 3.2 to make the Actor network and the Critic network converge gradually, wherein in a training process, if the vehicle has a backward collision or rollover, a current round is terminated and a new round for training is started; and when the iteration reaches the maximum number of steps or the model is capable of making a backward anti-collision driving decision stably and accurately, the training ends; sub-step 4: outputting the decision-making policy by using the backward anti-collision decision-making model inputting the information obtained by the centimeter-level high-precision differential GPS, the inertia measurement unit, the millimeter wave radar and the CAN bus into the trained backward anti-collision driving decision-making model, such that proper steering wheel angle and throttle opening control quantities are capable of being quantitatively output to provide an effective and reliable backward anti-collision driving suggestion for a driver, so as to realize effective, reliable and adaptive backward anti-collision driving decision-making of the heavy commercial vehicle. 